Bahman Sheikh, NetID: bahmans2
Derek Zhang, NetID: derekz3
Teju Kandula, NetID: tkandu2
Sports analytics is one of the most demanding topics in statistical modeling these days. Different sport teams invest lots of money on their team each year. They need to evaluate carefully how they choose and who they need in different posts, what playing strategies works better for their team, e.g. offensive, defensive in order to spend less but achieve more. The team managers use these statistical models to determine what characteristics make a college player become a robust professional player as well as many other crucial questions and decisions they need to make each season. Sports analytics are a combination of historical events, statistical analysis, simulations and modeling according to a teams individual players, teams tactics and teams opponents that when properly analyzed and applied in the team can provide a competitive advantage to a team or even the individual players. By analyzing this data, sports analytics inform players and coaches in order to improve and aid in decision making both during, prior and post game. According to Wikipedia the term “sports analytics” was spread in typical sports culture following the release of the 2011 movie Moneyball, the Oakland Athletics team of 2002 created history by winning 20 consecutive games between August 13 and September 20 2002. Much of the Oakland Athletics success in that season is credited to a graduate in Economics from Harvard University joining the team in 1999 and implementing analysis of baseball statistics to assess and purchase players. Sports analytics helps sport teams make better, informed decisions which increases the performance of a sport team through better resource management. On top of that with sports analytics team managers can save or even makes money for the club by wisely investing on the players, finding unknown talent or find weaknesses in another player or a team. Hence these days sports analytics is rapidly becoming an essential element in the success of sports teams.
In this project we decided to enter the interesting and fast evolving field of data science. Our objective for this project is to statistically investigate the different critical features of baseball teams and players and find the significant ones to be able to predict the performance of a baseball team and/or performance of a player within a confidence level. Base on statistical data our main intention is to answer the following questions:
Some definitions: At-bats or batting average of a baseball player is the ratio of his number of hits to his number of opportunities-to-hit. Professional baseball players typically have at-bats average somewhere between 0.20 and 0.30. The player’s at-bats in a given year can be predicted by his cumulative history of performance and it can be considered as a critical feature of a professional player. On the other hand, features such as the number of runs allowed, runs scored, on base percentage, and slugging percentage of a team are the main features to evaluate the performance of a baseball team in the season. According to the statistical analysis a baseball team required 90 to 100 wins in order to make it to the playoffs. A team typically needs to score 800 to 820 runs and allow 600-650 runs in order to make it to postseason.
In this project we intend to utilize the knowledge we gained in the course to analyze and understand how a statistical method can be used to develop a predictive model in sports analytics. We will use R to first clean and prepare the data and study the distribution of each individual predictor and identify any possible outliers. Then we will investigate the significance of our predictors to predict wins of a baseball team and the batting average of a baseball player based on our predictors. We will use model selection techniques: BIC and AIC considering adjusted R squared and LOOCV RMSE measures for a multiple linear regression and/or polynomial regression. At the end we will investigate the different informative criteria for a baseball team to make to playoff, postseason or championship.
For this project we consider two sets of data:
1) MLB Statistics 1962-2012 In the early 2000s, Billy Beane and Paul DePodesta worked for the Oakland Athletics. With helps of statistical analysis they changed the game of baseball drastically. They realized that statistical data like on-base percentage and slugging percentage are very important when it came to predict scoring runs. They could predict performance of a player and recruit the player with a small budget before other team manager even think of the player. This data set contains some of the information that was available to Beane and DePodesta in the early 2000s. This dataset contains 1232 cases with 15 feature variables. The features are as follow:
MLBData = read.csv("MLB.csv")
head(MLBData)
## Team League Year RS RA W OBP SLG BA Playoffs RankSeason
## 1 ARI NL 2012 734 688 81 0.328 0.418 0.259 0 NA
## 2 ATL NL 2012 700 600 94 0.320 0.389 0.247 1 4
## 3 BAL AL 2012 712 705 93 0.311 0.417 0.247 1 5
## 4 BOS AL 2012 734 806 69 0.315 0.415 0.260 0 NA
## 5 CHC NL 2012 613 759 61 0.302 0.378 0.240 0 NA
## 6 CHW AL 2012 748 676 85 0.318 0.422 0.255 0 NA
## RankPlayoffs G OOBP OSLG
## 1 NA 162 0.317 0.415
## 2 5 162 0.306 0.378
## 3 4 162 0.315 0.403
## 4 NA 162 0.331 0.428
## 5 NA 162 0.335 0.424
## 6 NA 162 0.319 0.405
2) Major league baseball player statistics data
This dataset contains 4535 cases with 50 feature variables of a group of individual baseball players during the years 1959-2004, it was obtained from the Lahman Baseball Database.
LahmanData = read.csv("LahmanBaseballDatabase.csv")
head(LahmanData)
## YEAR YRINDEX PLAYERID NAMElast NAMEfirst TEAM LG LGCODE G AB R H HR
## 1 1960 2 aaronha01 Aaron Hank ML1 NL 2 153 590 102 172 40
## 2 1961 3 aaronha01 Aaron Hank ML1 NL 2 155 603 115 197 34
## 3 1962 4 aaronha01 Aaron Hank ML1 NL 2 156 592 127 191 45
## 4 1963 5 aaronha01 Aaron Hank ML1 NL 2 161 631 121 201 44
## 5 1964 6 aaronha01 Aaron Hank ML1 NL 2 145 570 103 187 24
## 6 1965 7 aaronha01 Aaron Hank ML1 NL 2 150 570 109 181 32
## RBI TB OB PA DBL TR SB CS BB SO IBB HBP SH SF GIDP AVG OBP SLG EXP
## 1 126 334 234 664 20 11 16 7 60 63 13 2 0 12 8 0.292 0.352 0.566 2
## 2 120 358 255 670 39 10 21 9 56 64 20 2 1 9 16 0.327 0.381 0.594 3
## 3 128 366 260 667 28 6 15 7 66 73 14 3 0 6 14 0.323 0.390 0.618 4
## 4 130 370 279 714 29 4 31 5 78 94 18 0 0 5 11 0.319 0.391 0.586 5
## 5 95 293 249 634 30 2 22 4 62 46 9 0 0 2 22 0.328 0.393 0.514 6
## 6 89 319 242 639 40 1 24 4 60 81 10 1 0 8 15 0.318 0.379 0.560 7
## PAYR MLAVG MLOBP MLSLG AVGcum OBPcum SLGcum ABcum Rcum Hcum HRcum RBIcum
## 1 598 0.265 0.335 0.407 0.324 0.377 0.602 1219 218 395 79 249
## 2 598 0.268 0.340 0.418 0.325 0.378 0.599 1822 333 592 113 369
## 3 598 0.268 0.337 0.412 0.324 0.381 0.604 2414 460 783 158 497
## 4 598 0.255 0.319 0.389 0.323 0.383 0.600 3045 581 984 202 627
## 5 598 0.261 0.324 0.396 0.324 0.385 0.587 3615 684 1171 226 722
## 6 598 0.255 0.322 0.389 0.323 0.384 0.583 4185 793 1352 258 811
## PAcum OBcum TBcum G_Lag1 AB_Lag1 R_Lag1 H_Lag1 HR_Lag1 RBI_Lag1 TB_Lag1
## 1 1357 512 734 154 629 116 223 39 123 400
## 2 2027 767 1092 153 590 102 172 40 126 334
## 3 2694 1027 1458 155 603 115 197 34 120 358
## 4 3408 1306 1828 156 592 127 191 45 128 366
## 5 4042 1555 2121 161 631 121 201 44 130 370
## 6 4681 1797 2440 145 570 103 187 24 95 293
## OB_Lag1 PA_Lag1 DBL_Lag1 TR_Lag1 SB_Lag1 CS_Lag1 BB_Lag1 SO_Lag1 IBB_Lag1
## 1 278 693 46 7 8 0 51 54 17
## 2 234 664 20 11 16 7 60 63 13
## 3 255 670 39 10 21 9 56 64 20
## 4 260 667 28 6 15 7 66 73 14
## 5 279 714 29 4 31 5 78 94 18
## 6 249 634 30 2 22 4 62 46 9
## HBP_Lag1 SH_Lag1 SF_Lag1 GIDP_Lag1 AVG_Lag1 OBP_Lag1 SLG_Lag1 EXP_Lag1
## 1 4 0 9 19 0.355 0.401 0.636 1
## 2 2 0 12 8 0.292 0.352 0.566 2
## 3 2 1 9 16 0.327 0.381 0.594 3
## 4 3 0 6 14 0.323 0.390 0.618 4
## 5 0 0 5 11 0.319 0.391 0.586 5
## 6 0 0 2 22 0.328 0.393 0.514 6
## AVGcum_Lag1 OBPcum_Lag1 SLGcum_Lag1 ABcum_Lag1 Rcum_Lag1 Hcum_Lag1 HRcum_Lag1
## 1 0.355 0.401 0.636 629 116 223 39
## 2 0.324 0.377 0.602 1219 218 395 79
## 3 0.325 0.378 0.599 1822 333 592 113
## 4 0.324 0.381 0.604 2414 460 783 158
## 5 0.323 0.383 0.600 3045 581 984 202
## 6 0.324 0.385 0.587 3615 684 1171 226
## RBIcum_Lag1 PAcum_Lag1 OBcum_Lag1 TBcum_Lag1
## 1 123 693 278 400
## 2 249 1357 512 734
## 3 369 2027 767 1092
## 4 497 2694 1027 1458
## 5 627 3408 1306 1828
## 6 722 4042 1555 2121
In this section we are creating necessary functions to be used later in order to analyze the data. Also to be more organized all of the necessary libraries are gathered and loaded in this section.
diagnostics() functions# make_conf_mat() function--------------------------
### create function to generate the Confusion Matrix for a classifier
make_conf_mat = function(predicted, actual) {
table(predicted = predicted, actual = actual)
}
# diagnostics() function--------------------------
### create function to generate fitted vs. residuals plot and normal qq-plot
diagnostics = function(model, pcol = "grey", lcol = "dodgerblue", alpha = 0.05, plotit = TRUE, testit = TRUE){
if(plotit) {
par(mfrow = c(1,2))
plot(fitted(model), resid(model), col = pcol, pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Fitted versus Residuals")
abline(h = 0, col = lcol, lwd = 2)
qqnorm(resid(model), main = "Normal Q-Q Plot", col = pcol)
qqline(resid(model), col = lcol, lwd = 2)
}
if(testit) {
results = list(p_val = 0.0, decision = "reject")
results$p_val = shapiro.test(resid(model))$p.value
if(results$p_val < alpha){
results$decision = "Reject"
} else {
results$decision = "Fail to Reject."
}
return(results)
}
}
### `outlier_diagnostic()` function
# get_influence() function-----------------------------------
### create function to compute influence for all observations
get_influence = function(model){
influence = cooks.distance(model)
return(influence)
}
# hi_influence_index() function-------------------------
### create function to index high influence observations
hi_influence_index = function(model){
# compute influence for all observations
influence = get_influence(model)
# index high influence observations
hi_influence_index = influence > 4 / length(influence)
return(hi_influence_index)
}
# outlier_count() function-------------------------------------
### create function to count the number of outliers for a model
outlier_count = function(hi_influence_index){
# count number of high influence observations
outlier_count = sum(hi_influence_index)
return(outlier_count)
}
# outlier_remove() function------------------------------------------
### create function to remove significant outliers from training data
### remove observations on the basis of influence (cook's distance) only
outlier_remove = function(hi_influence_index, train_set){
# subset train data to exclude high influence observations
clean_set = train_set[-hi_influence_index, ]
return(clean_set)
}
# outlier_diagnostic() function----------------------------------------------------
### create function to package the functionalities of outlier_count / outlier_remove
### enriching/processing raw data
outlier_diagnostic = function(model, train_set = NULL){
# refit branch
if(!is.null(train_set)){
# index high influence observations
hi_influence_index = hi_influence_index(model)
# count number of outlier
outlier_count = outlier_count(hi_influence_index)
# remove significant outliers from training data
clean_set = outlier_remove(hi_influence_index, train_set)
# format output
out = list(outlier_count = outlier_count,
clean_set = clean_set)
return(out)
} else {
# index high influence observations
hi_influence_index = hi_influence_index(model)
# count number of outlier
outlier_count = outlier_count(hi_influence_index)
# format output
out = list(outlier_count = outlier_count)
return(out)
}
}
goodness_of_fit() functions# get_loocv_rmse() function--------------
### create function to compute loocv rmse
get_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}
# get_AIC() function----------------
### create function to calculate AIC
get_AIC= function(model) {
AIC(model)
}
# get_BIC() function----------------
### create function to calculate BIC
get_BIC= function(model) {
BIC(model)
}
# get_adj_r2() function--------------------------
### create function to extract adjusted r-squared
get_adj_r2 = function(model) {
summary(model)$adj.r.squared
}
# rmse() function-----------------------------------------------
### create function to compute rmse given actual and fitted data
rmse = function(y_actual, y_hat, n){
e = y_actual - y_hat
sse = sum(e ^ 2)
rmse = sqrt(sse / n)
return(rmse)
}
# get_test_rmse() function------------------
### create function to compute test set rmse
get_test_rmse = function(model, test_set, response) {
# predict responses using the fitted model on test set predictor data
y_hat = as.vector(predict(model, test_set))
# store actual responses in vector
y_actual = as.vector(test_set[, response])
# extract number of observation
n = length(y_actual)
# compute rmse
rmse(y_actual, y_hat, n)
}
# goodness_of_fit() function------------------------------
### create function to compute all goodness of fit measures
goodness_of_fit = function(model, test_set, response) {
# get loocv rmse
loocv_rmse = get_loocv_rmse(model)
# get AIC
AIC = get_AIC(model)
# get BIC
BIC = get_BIC(model)
# get adjusted r-squared
adj_r2 = get_adj_r2(model)
# get test set rmse
test_rmse = get_test_rmse(model, test_set, response)
# format output
out = list(loocv_rmse = loocv_rmse, AIC = AIC, BIC = BIC, adj_r2 = adj_r2, test_rmse = test_rmse)
return(out)
}
test_collinearity() functions# test_decision_VIF() function------------------------------------------------------
### create function to make a statistical decision on the basis of a given threshold
test_decision_VIF = function(p, threshold){
# decision branch
if(isTRUE(p > threshold) == TRUE){
# reject null
return(paste("Reject"))
} else {
# fail to reject null
return(paste("Fail to Reject"))}
}
# get_VIF() function------------------------------------------------------------------------
### create function to compute the variance inflation factors for all est. beta coefficients
get_VIF = function(model) {
vif(model)
}
# test_collinearity() function---------------------------------------------------------
### create function to extract max VIF and determine whether collinearity is an issue
### "Reject" = collinearity is an issue, "Fail to Reject" = collinearity is not an issue
VIF_conventional_alpha = 5
test_collinearity = function(model, alpha = VIF_conventional_alpha) {
# compute VIFs
vif = get_VIF(model)
# extract max VIF
max_vif = max(abs(vif))
# determine whether collinearity is an issue
collinearity = test_decision_VIF(max_vif, alpha)
# format output
out = list(max_vif = max_vif, collinearity = collinearity)
return(out)
}
resid_diagnostic() function# test_decision() function------------------------------------------------------
### create function to make a statistical decision on the basis of a given alpha
test_decision = function(p, alpha){
# decision branch
if(isTRUE(p < alpha) == TRUE){
# reject null
return(paste("Reject"))
} else {
# fail to reject null
return(paste("Fail to Reject"))}
}
# test_normality() function----------------------------------------------------
### create function to perform test for the normality of fitted model residuals
### "Reject" = normality is suspect, "Fail to Reject" = normality is not suspect
test_normality = function(model){
test = shapiro.test(resid(model))
p = test$p
return(p)
}
# test_constantvar() function---------------------------------------------------------
### create function to perform test for the homoscedasticity of fitted model residuals
### "Reject" = homoscedasticity is suspect, "Fail to Reject" = homoscedasticity is not suspect
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
test_constantvar = function(model){
test = bptest(model)
p = as.vector(test$p.value)
return(p)
}
# test_diagnostic() function---------------------------------------------------------
### create function to package the functionality of test_normality / test_constantvar
test_diagnostic = function(model, alpha){
# perform Shapiro-Wilk normality test
p_norm = test_normality(model = model)
# determine statisitical decision arrived at through S-W test
decision_norm = test_decision(p = p_norm, alpha = alpha)
# perform Breusch-Pagan normality test
p_convar = test_constantvar(model = model)
# determine statisitical decision arrived at through B-P test
decision_convar = test_decision(p = p_convar, alpha = alpha)
# format output
out = list(p_norm = p_norm , decision_norm = decision_norm,
p_convar = p_convar, decision_convar = decision_convar)
return(out)
}
# resid_diagnostic() function---------------------------------------------------------
### create function to package the functionality of plot_qq/ plot_fvr / test_diagnostic
resid_conventional_alpha = 0.05
resid_diagnostic = function(model, pcol = "slategrey", lcol = "red", alpha = resid_conventional_alpha,
plot_qq = FALSE, plot_fvr = FALSE, testit = FALSE){
# store plot_qq output
qq = function(){plot_qq(model, pcol, lcol)}
# store plot_fvr output
fvr = function(){plot_fvr(model, pcol, lcol)}
# output scheme
if(testit) {
return(test_diagnostic(model, alpha))
}
if(plot_qq) {
qq()
}
if(plot_fvr) {
fvr()
}
}
1) MLB Statistics 1962-2012
Some grouping of the dataset
Teams made to playoff
MLBData$RunDiff = MLBData$RS - MLBData$RA
playOff = MLBData[which(!is.na(MLBData$RankPlayoff)),]
notPlayOff = MLBData[which(is.na(MLBData$RankPlayoff)),]
As mentioned the MLB dataset has 15 feature varables in our project to create a predictive model we considered 9 variables as the following:
To get insgight about the MLB dataset and its feature variables lets first take a look to the distribution of the each variables:
pairs(MLBData, col="blue")